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Abstract 

cn . 

■ The problem of clustering fingerprint vectors is an interesting problem in Computational 
^ I Biology that has been proposed in [S] . In this paper we show some improvements in closing 

■ the gaps between the known lower bounds and upper bounds on the approximability of some 
I variants of the biological problem. Namely we are able to prove that the problem is APX- 

• ■ hard even when each fingerprint contains only two unknown position. Moreover we have 

^ I studied some variants of the orginal problem, and we give two 2- approximation algorithm 

I— ' i for the lECMV and OECMV problems when the number of unknown entries for each vector 

, ' is at most a constant. 

> : 

oo ■ 1 Introduction 

O ; 

, High-throughput approaches for the examination of microbial communities are becoming in- 

ly-^ I creasingly important, especially after the oligonucleotide fingerprinting strategy has found wide 

O ' application, allowing the identification of thousands of cDNA clones IH El IHl EI ■ After the 

c/5 . rDNA clone libraries are constructed, the clones are classified by individual hybridization ex- 

I periments on DNA microarrays with a series of short DNA oligonucleotides into clone types or 

■ operational taxonomic units (OTUs), where a an OTU is a set of DNA clones sharing the same 
^ . set of oligonucleotides that have successfully hybridized. Once classified, the nucleotide sequence 

^ I of representative clones from each OTU can then be obtained by DNA sequencing to provide 

■ ■ ■ ' phylogenetic descriptions of the microorganisms. One of the key features of this strategy is that 

after a comprehensive database, that correlates hybridization patterns with nucleotide sequence 
data, has been compiled, little additional rDNA clone sequencing will be required, resulting 
in significant reduction of cost and effort. The effectiveness of this general strategy has been 
demonstrated in the biotechnology arena, where it is currently being used to screen and identify 
millions of cDNA clones ^3,- 

The oligonucleotide fingerprinting method is commonly used to study DNA clone libraries. 
Such method naturally leads to a combinatorial problem where for each oligonucleotide we are 
give a fingerprint over the alphabet {0,1, A^}, where the values or 1 means respectively that 
the a hybridization has happened or not with a certain clone, while the value N stands for the 
fact that we are unable to determine if the hybridization has happened or not (typically it is due 
to the fact that there are two control signals, and the values between those two control signals 
mean that either result might have happened). 

The combinatorial problem that naturally arises is called CMV. In such problem we are 
given a set of fingerprints and we would like to change each A^-symbol in the input fingerprints 
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into or 1, so that the total number of distinct fingerprints (over {0, 1}) is minimized. Actually 
we are not interested into the actual fingerprints over {0, 1}, but only in determining the clusters 
of fingerprints. 

Unfortunately the problem is NP-hard, therefore it is important to study if some restrictions 
become tractable. For instance it is possible to restrict the problem to instances where each 
input fingerprint contains at most p A^-symbols, and we will call such problem CMV(p). It is 
already known that CMV(2) is NP-hardfT], while CMV(l) can be solved in polynomial-timejH], 
so for all interesting values oip we have to concentrate on developing approximation algorithms. 
CMV(p) is known to be approximable within factor 2^ jHI and min(l + Inn, 2 +pln/) jjj, where 

1 is the lenght of the fingerprint vectors. In this paper we strengthen the NP-hardness result 
proving that CMV(2) is APX-hard, that is it cannot be approximated within an arbitrarily 
small (1 + e)-factor polynomial-time algorithm unless P=NP j2j. 

Moreover we will study two related optimization problems, namely lECMV and OECMV 
where we want to minimize the number of pairs of compatible fingerprints that are not clus- 
tered together and the number of pairs of incompatible fingerprints that are clustered together, 
respectively. Again we are interested in the restrictions of lECMV and OECMV with at most p 
missing values in each fingerprint (those problems are denoted by lECMV(p) and OECMV(p) 
respectively). The IECMV(|3) problem is known to be approximable within factor 2^^~^ for any 
p = O(logn) 0, while the restriction of OECMV where no two compatible fingerprint vectors 
can have value at the same position can be approximated within factor 2(1 — ^),7 • 

In this paper we improve those approximation results, proving that both lECMV(p) and 
0ECMV(j5) problems are APX-hard, and we show that a simple greedy algorithm achieves a 2 
approximation ratio for both problems. 

2 Preliminary Definitions 

In this section, we introduce some basic notations and definitions that we will need later. A 
fingerprint vector (in short fingerprint) is a vector over the alphabet {0, 1, A'^}, where 1 represents 
a hybridization, represents no hybridization and A^ represents unknown data (that is we are 
unable to determine if hybridization has happened or not). In all instances of the problems that 
we will study, all fingerprints have the same length, that is they contain the same number of 
elements. Usually we will denote by / the lenght of a fingerprint. 

Two fingerprints vectors vi = ui[2], . . . , and V2 = {v2[l\., ■ ■ ■ ■, V2[l]) are compatible 

if for any position i where they differ, at least one of vi[i] and V2[i] is equal to N. A resolved 
vector r = {r[l], . . . ,r[k]) of a fingerprint vector v = {v[l], . . . ,v[k]) is a vector over alphabet 
{0, 1} such that for each 1 < z < Z, if v[i] ^ N then v[i] = r[i], that is r and v agree on each 
position where v is not unknown. Sometimes it is useful the analyze the effect of a parameter, 
the maximum number of As allowed in a fingerprint; we will denote by p such parameter. We 
are now ready to present the problem we will study. 

Clustering with p missing values (CMV(p)): We are given a set F of fingerprint vectors 
with at most p Ns and we want to partition F into disjoint subsets Fi, . . . ,Fi. such that any 
two vectors in Fi are compatible and the cardinality of the partition is minimized. 

Inside Clustering with p missing values (lECMV(p)): We are given a set F of finger- 
print vectors with at most p Ns and we want to partition F into disjoint subsets Fi, . . . ,Fk such 
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that any two vectors in Fi are compatible and the number of compatible pairs of vectors within 
the same clusters is maximized. 

Outside Clustering with p missing values(OECMV(p)): We are given a set F of 
fingerprint vectors with at most p Ns and we want to partition F into disjoint subsets Fi, . . . , 
such that any two vectors in Fj are compatible and the number of compatible pairs of vectors 
belonging to different clusters is minimized. 

Notice that for all the aforementioned problems, the instance is a set F of fingerprints and 
the output is a partition of F where in a set of the partition there are only pairwise compatible 
fingerprints. It is easy to notice that pairwise compatibility is a sufficient condition to prove 
the existence of a common resolution for all fingerprints in the set. For simplicity's sake in the 
following we will denote by n the number of fingerprints in an instance F. 

3 An approximation algorithm for lECMV(p) and OECMV(p) 

In this section wc present an approximation algorithm for both IECMV(p) and OECMV(p) 
problems, where p is any fixed constant. Wc are able to provide two different analysis, one for 
each problem, showing that we achieve a 2-approximation for both problems. 

Given a set F of fingerprints, since p is a constant we are able (in 0(2^n)^ time) to compute 
the set R = {ri, . . . ,rfc} of all possible resolved fingerprints that arc compatible with at least 
one fingerprint in F. Given a resolved fingerprint r, we denote by s{r) the set of fingerprints 
in F that are compatible with r, and denote by p{s{r)) the set of pairs of vectors in s{r). The 
degree of a fingerprint r, denoted by d{r), is defined as the cardinality of s{r). 

The algorithm constructs a partition P of F greedily as follows: initially let P be an empty 
set and let U be equal to F. At each iteration the algorithm computes the resolved fingerprint 
r of maximum degree (i.e. r is the resolved fingerprint compatible with the maximum number 
of fingerprints in U), adds s(r) as a set of the solution P and removes all fingerprints in s(r) 
from U. The algorithm iterates such step until U is empty. 

3.1 Analysis for IECMV(7;) 

Let 5 = {si, . . . , s^.} be a solution S of IECMV(|)). The value of S is the number of compatible 
fingerprints vectors co-clustered by S and is denoted by V{S). It holds that V{S) = |P(si)|, 
where P{si) is the set of pairs of fingerprints in Sj. Generalizing such notion, we denote by P{S) 
the set of all the pairs co-clustered in the partition S, that is P{S) = u\^-^P{si). Let W C U he 
a subset of fingerprint vectors, we denote by P{S, W) the set of pairs (x,y) in P{S) such that 
at least one of x, y is in VF. 

In the following we will show that the algorithm has approximation factor 2. The algorithm 
computes a sequence (ri,...,rfc) of resolved fingerprints, one at each iteration. At the i-th 
iteration the algorithm contructs a set of the partition containing and all fingerprints that are 
compatible with rj and have not been put in a partition during one of the previous iterations 
(we will denote such set by Sj). For ease of the analysis, we will denote by Ui the set U at the 
beginning of the i-th iteration, consequently Ui = F, Ui+i = Ui \ Sj, for 1 < z < A;, where k is 
the number of sets in the output partition. Recall that the partition output by the algorithm is 
denoted by S* = {si, . . . , s^}. The optimal partition is denoted by Opt = {opti, . . . , opti}, where 
I can be different from k. 



3 



By definition, the value of the optimal solution is \P{Opt)\; our goal will be to show that 
\P{Opt)\ < 2|P(5)|. We introduce some sets as follows: P{Opt, 1) = P{Opt,si), and P{Opt,i + 
1) = P{Opt,Si) \ Ui<j<i j) for 1 < z < A;. A fundamental property is that {P{Opt,i) : 
1 < i < k} is a partition of P{Opt). 

In fact the sets P{Opt, i) are disjoint by construction. Since S = {si, . . . , Sfc} is a partition 
of F, then P{Opt) = [j P{Opt, Sj). Let (.x, y) be a pair of P{Opt). W.l.o.g. we can assume that 
X & Si, y (z Sj, with i < j. Then (x, y) G P{Opt, Sj) and (x, y) does not belong to any P{Opt, h) 
with h < i, therefore (x,y) € P{Opt,i). Consequently the sets P{Opt,i) are a partition of 
P{Opt), and the value of the optimal solution is equal to Y^- \P{Opt, 

Consequently, in order to prove that our greedy algorithm achieves a 2 approximation, it 
suffices to show that, for each i, \P{Opt,i)\ < 2|P(sj)|. 

Lemma 3.1. Let S = {si, . . . , Sfc} be the solution computed by the algorithm, and let Opt be an 
optimal solution. Then \P{Opt,i)\ < 2|P(sj)| for 1 <i < k. 

Proof. Let Si be the set added to the solution S at the i-th step of the algorithm. All pairs 
in P{Opt,i) must belong to Ui x Ui, by definition of P{Opt,i). Each clement x in Ui is in the 
same set of the optimal solution with at most \si\ — 1 other fingerprints of Ui, for otherwise the 
algorithm would not have chosen Sj at the i-th iteration, but x and all fingerprints in Ui that are 
in the same set of the optimal solution. By definition of P{Opt,i), there are at most |si|(|si| — 1) 
pairs in P{Opt,i), which completes the proof, since in Sj there are |si|(|si| — l)/2 pairs. □ 

It is easy to see that approximation factor is tight. Consider three resolved vectors ri, 
r2, ra and four fingerprint vectors {fi, f2, fs, fi} such that s(ri) = {/i,/2}, s{r2) = {/i,/3}, 
s{r2) = {f2,f4:}- The approximation algorithm choose s(ri) as the first set and then {/s}, {fi} 
as the sets to complete the partition. Thus value of the approximated solution is 1, since one 
pair is selected. It is easy to sec that the optimal solution consists of set s(r2) = {/ij/s}, 
s{r2) = {f2, fi}, thus having value 2. 

3.2 Analysis for OECMV(p) 

The analysis in this section follows the one for lECMV(p). Let S = {si, . . . , s/.} be a solution S 
of OECMV(p) . The value of S is the number of compatible fingerprints vectors that are not co- 
clustered in S and is denoted by V{S). It holds that V{S) = \ Y^\=i l-^^('Si)l) where L{si) is the 
set of pairs {x,y) of compatible fingerprints where exactly one of x and y is in Sj. Generalizing 
such notion, we denote by L{S) the set of all unordered pairs of compatible fingerprints that are 
not co-clustered in the partition S, that is L{S) = U^J]^L(sj). Notice also that each pair in L{S) 

appears in exactly two sets L{si), therefore 1-^^(5)1 = ^ Y^ili \L{si)\. Let W QU he & subset of 
fingerprint vectors, we denote by L{S, W) the set of pairs {x,y) in L{S) such that at least one 
of X, y is in W. 

In the following we will show that the algorithm has approximation factor 2. The algorithm 
computes a sequence (ri,...,rfc) of resolved fingerprints, one at each iteration. At the i-th. 
iteration the algorithm contracts a set of the partition containing and all fingerprints that are 
compatible with rj and have not been put in a partition during one of the previous iterations 
(we will denote such set by Si). For ease of the analysis, we will denote by Ui the set U at the 
beginning of the i-th iteration, consequently Ui = F, CZj+i = Ui \ Si, for 1 < i < k, where k is 
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the number of sets in the output partition. Recall that the partition output by the algorithm is 
denoted by 5* = {si, . . . , s^}. The optimal partition is denoted by Opt = {opti, . . . , opti}, where 
I can be different from k. 

By definition, the value of the optimal solution is \L(Opt)\; our goal will be to show that 
2\L{0pt)\ > \L{S)\. We introduce some sets as follows: L{Opt,l) = L{Opt,si), and L{Opt,i) = 
L{Opt,Si) \ IJ^^j ^ - L{Opt, j) for 1 < i < k. A fundamental property is that {L{Opt,i) : 1 < 
i < A;} is a partition of L{Opt). In fact the sets L{Opt,i) are disjoint by construction. Since 
S = {si,...,Sfc} is a partition of F, then L{Opt) = \_}L{Opt^s,i). Let {x,y) be a pair of 
L{Opt). W.l.o.g. we can assume that x G Si, y G Sj, with i < j. Then {x,y) e L{Opt,Si) and 
does not belong to any L{Opt,h) with h < i, therefore {x,y) G L{Opt,i). Consequently 
the sets L{Opt,i) are a partition of L(Opt), and the value of the optimal solution is equal to 

E^\HOpt,i)\. 

Similarly we introduce the sets L{S,1) = L{si), L{S,i) = L{si) \ Ui<j<i ^('S')i) for 1 < 
i < k. A fundamental property is that {L{S,i) : 1 < i < A;} is a partition of L{S) and thus 
l-^l'S')! = Consequently, in order to prove that our greedy algorithm achieves a 2 

approximation, it suffices to show that, for each i, 2\L{Opt,i)\ > \L{S,i)\. 

Lemma 3.2. Let S = {si, . . . , Sk} be the solution computed by the algorithm, and let Opt be an 
optimal solution. Then 2\L(0pt,i)\ > \L{S,i)\ for 1 <i <k. 

Proof. Let Si be the set added to the solution S, at the z-th step of the algorithm. Given a 
fingerprint x € Si, we define C{x) as the set of all fingerprints in Ui that are compatible with x, 
and D{x) as the set C(x) fl L{Opt, i), that is the pairs in C{x) that are not co-clustered in Opt. 
Since x is clustered with |si| — 1 elements of Ui in S, there are exactly |C(x)| — |sj| + 1 pairs in 
L{S,i) containing x. It follows that \L{S,i)\ = ^^es^ (1^(^)1 ~ + 

All pairs in L{Opt, i) must belong to iUi\Ui+i) x Ui, by definition of L{Opt, i) (for simplicity, 
we will assume that Uk+i = 0)- Notice that, by construction of Sj, > |C(x)| — \si\ + 1. 

Clearly L{Opt,i) = Ux£UiD{x), by definition of D{x). Since each pair {y,z) E L{Opt,i) 
appears only in the two (not necessarily distinct) sets D{y) and D{z), then \L{Opt,i)\ > 
k '^xeUi 1-^(^)1 — i Z^xG(7, (|C'(2;)| — \si\ + 1) and the proof is completed. □ 

It is easy to see that also in this case the approximation factor is tight. Consider three 
resolved vectors n, r2, r^ and four fingerprint vectors {fi, f2, fs, fi} such that s(ri) = {/i,/2}, 
sir2) = {fijfs}, s(r2) = {/2,/4}- The approximation algorithm choose s(ri) as the first set 
and then {/s}, {f^} as the sets to complete the partition. Thus the value of the approximated 
solution is 2, since the pairs of compatible fingerprint vectors that are not co-clustered are 
(/ij/a) and (/2,/4). It is easy to see that the optimal solution consists of set s(r2) = {/i,/3}, 
5(^2) = {/2,/4}; hence the only pair of compatible fingerprint vectors not co-clustered in the 
optimal solution is (/i, 72) and the cost of the optimal solution is 1. 

4 APX-hardness of CMV(2) 

In this section we will prove that CMV(p) is APX-hard via an /-reduction from minimum vertex 
cover on cubic graph, whose APX-hardness has been proved in 
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Figure 1: A vertex gadget VGi 




Figure 2: An edge gadget EGij 



In particular we will combine two 1-reductions: (1) from minimum vertex cover on a graph G 
to minimum vertex cover on a gadget graph; (2) from minimum vertex cover on a gadget graph 
to CMV{p). 

First Reduction 

First we will define gadget graphs. Given a graph G = (V, E) for each vertex Vi we define a 
vertex gadget VGi consisting of 5 vertex. Three vertices, q^, q^, Qg are called docking vertices. 
Observe that the minimum vertex cover of a vertex gadget consists of 2 vertices, Cjj, Qg, and 
denote this cover as type 1. Observe that there is a cover of VGi consisting of 3 vertices q^, 0^4, 
Cjg, and denote this cover as type 2. 

For each edge {vi,Vj) we define an edge gadget EGij joining vertex gadgets VGi, in 
two of their docking vertices. An edge gadget consists of six vertices, the two docking vertices 
shared with the vertex gadgets and other four vertices. 

Theorem 4.1. Let C C V be a cover of G, with \G\ = k. Then there is a cover of the graph 
gadget of size 3k + 2{n — k) + 2m. 

Proof. Consider a vertex Vi in C, associate with the corresponding vertex gadget VGi a cover 
of type 2 (of size 3). For each vertex vj ^ C, associate with the corresponding vertex gadget 
VGj a cover of type 1 (size 2). Observe that for each edge gadget at least one of the adjacent 
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vertex gadget has a type 2 cover. Thus we just need to cover two vertices for each edge gadget 
to obtain a cover of each edge gadget and thus of the entire graph gadget. □ 

Lemma 4.2. Let C he a cover of the graph gadget of size 3k + 2(n — k) + 2m. Then we can 
compute in polynomial time a vertex cover of size at most 3k + 2(n — A;) + 2m such that it has 
only cover type 1 and type 2 and such that for each pair of adjacent vertex gadgets at least one 
has a cover of type 2. 

Proof. It is easy to see that if a vertex gadget has not a cover of type 1 we can substitute this 
solution with a cover of type 2, obtaining a solution with at least the same size. Now assume 
that two adjacent vertex gadgets VGi, VGj have both a cover of type 1. Then observe that the 
edge gadget EGij must be cover with at least 4 vertices. By covering VGj with a cover of type 
2, the edge gadget EGij needs to be covered just with 2 elements thus obtaining a cover of size 
less than 3A; + 2(n — k) + 2m. □ 

Theorem 4.3. Let C be a cover of the graph gadget of size 3k + 2(n — k) + 2m. Then there is 
a cover of the graph G of size k. 

Proof. Consider a vertex cover of size 3k + 2(n — k) + 2m. Then from previous lemma we can 
construct a solution of size at most 3k + 2{n — k) + 2m and such that for each edge gadget EGij 
at least one of VGi, VGj is of type 2. Thus, we can define a cover for G taking all the vertices 
corresponding to vertex gadgets with a cover of type 2. Since there are at most k vertex of this 
kind the theorem follows. □ 

Since a vertex cover of a cubic graph contains at least 1 1^1/4 vertices and \E\ = ^\V\, it 
follows that the above reduction is an 1-reduction. 



Second Reduction 

Now we reduce minimum vertex cover on gadget graph to CMV(p). The idea in our description is 
that it is possible to assign a resolved fingerprint to each vertex and an unresolved fingerprint to 
each edge. The set of unresolved fingerprints will be our instance of CMV(p), and all interesting 
solutions will pick their resolved fingerprints from those assigned to the vertices. More precisely 
we will show that each unresolved fingerprint (assigned to an edge) will be resolved to the 
fingerprints assigned to one of the endpoints of such edge. 

Let n denote the number of vertex gadgets. Each fingerprint consists of n chunks of 7 
positions, and each vertex in the vertex gadget VGi consists only of Os, except for the i-th 
chunk. Denote with w(cjj), v{ci2), v{ci^), v{ci^) and v{ci^) the resolved vectors associated with 
the vertices of VGi, define the i-th. chunk of these vectors as follows: v{cij^) 1110000, v{ci2) 
1111100, v{ci^) 1110011, v(ci^) 1001100, v{cirj 1000011. For example, the vertex v{ci^) 
of the i-th vertex gadget has fingerprint 0^'^*~^H0011000''("~*+^\ 

The vertices belonging exclusively to an edge gadget will have two chunks that are not 
completely made of Os. More precisely, let VGi and VGj be two adjacent vertex gadgets, we 
denote with v{eij^i), v{eij^2), 1^(64^,3), f(eij,4) the resolved vector associated with the vertices 
of the edge gadgets VGij. Only the i-th and the j-th chunks are not completely consisting of 
Os, and those chunks are represented in Tabled 
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Table 1: Possible values of fingerprints for an edge gadget 



Assume that Cij^i, eij^s are adjacent to a vertex of VGi, Ci^, and that eij^2, are adjacent 
to a vertex of VG^, c, . We define these resolved vectors as follows: 

Next we discuss the properties of the resolved vectors defined above. Each pair of resolved 
vectors associated with adjacent vertices has hamming distance 2. Each pair of resolved vectors 
associated with not adjacent vertices has hamming distance at least 3. 

Now we construct the instance of the problem, that is the fingerprint vectors. We associate 
a fingerprint with each edge of the graph gadget. Now, let y = (a, b) be an edge of the graph 
gadget, Va and V}, the resolved vectors associated with vertices a and h respectively, we associate 
with y the fingerprint vector Vy as follows: for each position / such that Va\l\ = vi)[l], it follows 
Ve[l] := Va[l]; for each position / such that Va[l] ^ Vb[l], it follows Ve[l] '■= N. 

Lemma 4.4. Each fingerprint vector has exactly two positions with value N. 

Proof. It is easy to see that by construction two resolved vectors associated with an edge differ 
in exactly two positions. Thus, the fingerprint vector associated with that edge has value N in 
those positions. □ 

A fundamental property of the instance of CMV(p) is the following: 

Lemma 4.5. Two fingerprint vectors can have a common resolution only it the edges encoded 
by such fingerprints share a common vertex. 

Proof. First observe that by construction each fingerprint vector fi can have at most 4 reso- 
lutions. Moreover, if r^^ and r^j are resolutions of fi having hamming distance 2, any other 
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resolution have hamming distance 1 from both r^j and rjj- Let fi be a fingerprint vector en- 
coding edge Cj = (11,12) and let fj be a fingerprint vector encoding edge ej = (ji,i2)- There is 
at least one pair of resolved vectors associated with the endpoints of e-j and ej having hamming 
distance at least 3; assume w.l.o.g. those vectors are r{ii) and r(ji). Note that none of r{ii) and 
r(ji) can be a common resolution for both /j and fj. Any resolution r* of fi different from r{ii) 
and r(i2), has hamming distance 1 from r(ii). Similarly, any resolution r| of fj different from 
r{ji) and r{j2) has hamming distance 1 from r{ji). Thus, r? and r* have hamming distance 
at least 1 and thus are not be identical. It follows that none of rf and can be a common 
resolution for both /j and fj. Thus fi and fj have a common resolution only if r(z2) and r{j2) 
are the same vector, that is they encode the same vertex. □ 

Theorem 4.6. Let C be a cover of the graph gadget of size Sk + 2(n — k) + 2m. Then there is 
a solution of CMV(p) of size Sk + 2(n — k) + 2m. 

Proof. Consider a vertex cover of size 3k + 2(n — k) + 2m. Thus, we can define a solution of 
CMV(p) taking as resolution the set of vertices associated with the cover. □ 

Theorem 4.7. Let C be a solution of CMV(p) of size Sk + 2(n — k) + 2m, then there is a cover 
of the graph gadget of size Sk + 2{n — k) + 2m. 

Proof. Consider a solution for CMV(p). If a fingerprint vector is associated with a resolved 
vector not associated with a vertex of the gadget graph, then this resolution is not common 
to any other fingerprint vector of the instance. Thus, we can replace it with a resolved vector 
associated with a vertex of the graph without increasing the size of the solution. Then for each 
resolution chosen, add the corresponding vertex to the cover of the gadget graph. □ 

It is easy to see that also this second reduction is an 1-reduction. 

5 MAX-SNP hardness of IECMV(2) 

In the following section we prove that lECMV(p) is MAX-SNP hard via an 1-reduction from 
Maximum Independent Set on Cubic Graphs (MISCG). Let G = {V,E) be a cubic graph, the 
MISCG problem asks for the subset V' C V of maximum cardinality, such that vertices in V' 
are not adjacent. 

We associate with a vertex Vi of V & set of 9 fingerprint vectors. First wc introduce a set 
of 8 resolved vectors, Ci = {ci^,Ci2,Ci^,Ci^,Ci^,Cig,Ci^,Cig}, such that the resolved vectors in Ci 
are possible solutions of the fingerprint vectors. We represent this situation through a graph, 
denoted as compatibility graph CGi, such that the resolved vectors in Q are the vertices of CGi, 
while the fingerprint vectors are the edges of CGi. A fingerprint vector associated with an edge 
(ci^,Cj^) can be resolved by both Ci„ and Cj^ and by no other resolved vector in C = IJi^'i- 
Three vertices of CGi, Ci^, Ci^ and c^g are called docking vertices. 

For each edge e = [vi, Vj) G E, define a fingerprint vector that is compatible with a resolved 
vector associated with a docking vertex of CGi and a resolved vector associated with a docking 
vertex of CGj. We represent this fingerprint vector in the graph as an edge, Ei^j that joins 
the compatibility graphs associated with vertices CGi and CGj. The graph obtained will be 
denoted as CG. 
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Figure 3: A compatibility graph CGi 



Assume that |y| = n and \E\ = m. The complete vectors of the instance of lECMV(p) have 
length 5n, 5 positions are associated with each vertex. Assume w.l.o.g. that vertex Vi is adjacent 

to vertices vj, and Vk and in particular that is adjacent to CGj, Ci,^ is adjacent to CGh 
and is adjacent to CG^. Complete vectors associated with CGi are defined as follows: 

• Cj^ has value 1 in the position 5j — 4, Cjg has value 1 in the position 5h — 4, Cjg has value 
1 in the position 5k — 4. 

• for any other position not in [5i — 4, 5i] all the complete vectors associated with CGi have 
value 0. 

• for the positions in [5i - 4, 5i], = 11000, Ci^ = 11010, Ci^ = 10010, Ci^ = 11100, 

= 10110, Cig = 11110, Ci^ = 11011, Cjg = 10100. 

Ler R be the set of the resolved vectors associated with vertices of the graph. Now we 
construct the instance of the problem, that is the fingerprint vectors. We associate a fingerprint 
vector with each edge of the graph gadget. For an edge of the compatibility graph, let y = (a, b) 
be an edge of the graph gadget, Va and vi, the resolved vectors associated with a and b respectively, 
we associate with y the fingerprint vector Vy as follows: for each position I such that Va[l] = Vb[l], 
it follows Vy[l] := Va[l]; for each position I such that Va[l] 7^ vij[l], it follows Vy[l] := N. 

It is easy to see that each fingerprint vector will have at most 2 positions having value N, 
since two resolved vectors associated with adjacent vertices will have at most hamming distance 
equal to 2. 

Lemma 5.1. Let S be a solution of lECMV(p), then there is a solution S' having at most the 
same cost and such that each resolved vector of the solution is a resolved vector in R. 

Proof. Let fx, fy be two fingerprint vectors, they are compatible if and only if are associated 
with two edges incident on a common vertex. Moreover, observe that there exists a unique 
resolved vertex that can be a common resolution of both fx and fy, unless they are associated 
with an edge incident on the same docking vertex c^. In this case they can have two common 
resolutions, rz-^ and rz2- Assume that r^^ is associated with Cz, there is a single position I not 
in [5z — 4, 5z] where rz^ has value 1. r^j is the resolved vector having a in position I and equal 
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Figure 4: A compatibility graph Eij 



to in any other position. Since no other vertices is compatible with r^j it follows that we 
can substitute with without decreasing the cost of the solution. □ 

Thus we can restrict to the solution where each set s„ corresponds to a resolved vector 
associated with a vertex v of the graph CG and the fingerprint vectors associated with (some) 
edges incident on v are assigned to Sy. In what follows we show that for a solution of lECMV(p) 
of a compatibility graph CGi we can restrict to the following cases: 

• Solution A: 9 pairs of fingerprint vectors are co-clustered; this means that Ci^, Ci^ and Cjg 
are resolved vectors of the solution. 

• Solution B: 4 pairs of fingerprint vectors are co-clustered; this means that Qj, Ci^, Ci^ and 

resolved vectors of the solution. 

Lemma 5.2. Solution B is the maximum solution that has 1 pair for each of the docking vertex 
of CGi. 

Proof. Let Z he a solution such that the sets associated with resolved vectors Qj, Ci^, Cjg have 
all one pair. It is easy to see that the set associated with resolved vector Cjg is the only set that 
can have more than one element. Thus the lemma follows. □ 

Let Z he a solution of lECMV(p) for CGi such that it has one set Sx associated with a 
resolved vector x of a docking vertex. The set Sx will contain two fingerprint vectors. If we 
assign the fingerprint vector associated with the edge Eij incident on x to Sx, we gain 2 pairs. 
If we have a solution A for a compatibility graph VGi and we assign the fingerprint vectors of 
EGij to Cj^, we gain pairs. Note that if two adjacent compatibility graphs have as solutions 
the sets corresponding to the two docking vertices, it follows that only one of these sets can 
gain pairs. Next we show that, gaining pairs from Eij, no solution different from solution B 
can become better than solution A. Let Z he a solution of IECMV(p) different from solution A 
and solution B. If exactly one of the sets of Z corresponds to a docking vertex, it follows that it 
can have at most 6 pairs. In fact, the optimal solution in this case has one set with 3 pairs and 
three sets each one with one pair. If exactly two of the sets of Z correspond to docking vertices, 
it follows that it can have at most 4 pairs. In fact, the optimal solution in this case has four 
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sets, each one with one pair. Since no other solution can gain pairs from EGij it fohows that 
no solution except solution B can become better than solution A. 

Thus the optimal solution for CGi and EGij, EGi^h, EGi^k is to have solution B for CGi 
and add fingerprint vectors associated with EGij, EGi^h, EGi^k to the sets corresponding to 
the docking vertices. Each of these sets will have three elements, thus 3 pairs, and the solution 
has 10 pairs. In what follows we will denote such a solution with solution B. Moreover, any 
solution different from the solution constructed above, it is worse than solution A. It follows 
that the problem of maximizing the number of co-clustered pairs of fingerprint vectors consists 
of building an independent set of compatibility graphs (each one is associated with solution B). 

Lemma 5.3. There exists an independent set of size k if and only if exists a solution of 
IECMV(p) having at least IQk + 9(77, — k) pairs. 

Proof. Let V an independent set of G such that = fc, construct a solution S of lECMV(p) 
such that the component graphs associated with vertices in V' have a solution of type B and 
any other component graph has a solution of type A. Then it follows c{S) = 10/c + 9(n — k). 

Now let S* be a solution with cost 10/c + 9(n — k). Now we can construct a solution having 
at least the same cost defining for each component graph that has a cost less than 10 a type A 
solution. Since the component graphs having cost 10 must not be adjacent, at least k independent 
component graph must have type B solution in S and thus the corresponding vertices are an 
independent set of size k. □ 

Since for each cubic graph \E\ = and there exists an independent set of size at least 
|y|/4, it follows that the above reduction is an 1-reduction. 

5.1 MAX-SNP hardness of OECMV(2) 

It is easy to see that the 1-reduction described above to prove the MAX-SNP hardness of 
IECMV(]3) can be used also to prove the MAX-SNP hardness of OECMV(p). Note that con- 
sidering the set of fingerprint vectors associated with a component graph CGi a-nd with edges 
EGij, EGiji, EGi^k, we can have 19 compatible pairs of fingerprint vectors. As in the previous 
reduction, the best solution for this set of fingerprint vectors is type B solution. Since type 
B solution co-clusters 10 pairs of compatible fingerprint vectors, it follows that it does not co- 
cluster 19 — 10 = 9 pairs of compatible fingerprint vectors. Similarly type A solution does not 
co-cluster 19 — 9 = 10 pairs of compatible vectors and no other solution different from type B 
solution is better than type A solution. Hence the 1-reduction for OECMV(p) follows directly 
from the 1-reduction for IECMV(p). 
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